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Out-of-sample generalizations for supervised 
manifold learning for classification 
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Abstract —Supervised manifold learning methods for data 
classification map data samples residing in a high-dimensional 
ambient space to a lower-dimensional domain in a structure¬ 
preserving way, while enhancing the separation between different 
classes in the learned embedding. Most nonlinear supervised 
manifold learning methods compute the embedding of the man¬ 
ifolds only at the initially available training points, while the 
generalization of the embedding to novel points, known as the 
out-of-sample extension problem in manifold learning, becomes 
especially important in classification applications. In this work, 
we propose a semi-supervised method for building an interpola¬ 
tion function that provides an out-of-sample extension for general 
supervised manifold learning algorithms studied in the context 
of classification. The proposed algorithm computes a radial basis 
function (RBF) interpolator that minimizes an objective function 
consisting of the total embedding error of unlabeled test samples, 
defined as their distance to the embeddings of the manifolds of 
their own class, as well as a regularization term that controls the 
smoothness of the interpolation function in a direction-dependent 
way. The class labels of test data and the interpolation function 
parameters are estimated jointly with a progressive procedure. 
Experimental results on face and object images demonstrate the 
potential of the proposed out-of-sample extension algorithm for 
the classification of manifold-modeled data sets. 

Index Terms —Manifold learning, dimensionality reduction, su¬ 
pervised learning, out-of-sample extensions, pattern classification. 


1. Introduction 

T he recovery of low-dimensional structures in data sets 
not only allows understanding the data but also provides 
useful representations for their treatment in several problems. 
Data classification is among the applications that benefit from 
the identification of low-dimensional structures in data. Unlike 
unsupervised manifold learning methods such as 0> 0. 

which only take the geometric structure of data samples 
into account when learning a low-dimensional embedding, 
many recent supervised manifold learning methods seek a 
representation that not only preserves the manifold structure 
in each class, but also enhances the separation between differ¬ 
ent class-representative manifolds in the learned embedding. 
Meanwhile, an important problem in data classification with 
supervised manifold learning is the generalization of the 
learned embedding to novel data samples. In this work, we 
address the problem of constructing a continuous mapping 
between the high-dimensional original data space and the 
low-dimensional space of embedding for data classification 
applications. 

Elif Vural and Christine Guillemot are with Centre de recherche IN- 
RIA Rennes - Bretagne Atlantique, France (elif.vural@inria.fr, Chris¬ 
tine. guillemot @ inria.fr). 


Supervised manifold learning methods can be categorized 
into two groups as linear and nonlinear algorithms. Linear 
methods such as 0 , 0 , 0 , 0 and @ learn a linear 
projection that maps data into a lower-dimensional space such 
that the proximity of neighboring samples from the same 
class is preserved, while the distance between samples from 
different classes is increased. Nonlinear methods such as 
have a similar classification-driven objective, while the new 
coordinates of data samples in the low-dimensional space are 
computed with a nonlinear learning process based on a graph 
representation of data. As linear dimensionality reduction 
methods compute a linear projection, they have the advantage 
that the generalization of the embedding for initially unavail¬ 
able data samples is immediate and given by the learned linear 
operator. However, with linear methods samples from different 
classes are not linearly separable in the learned embedding, 
unless they are already linearly separable in the original high¬ 
dimensional space, which is rarely the case. The separation 
between different classes is an important factor that influences 
the performance of classification. Nonlinear dimensionality 
reduction methods achieve a better separation as a result of 
their relative fiexibility in learning the coordinates. In fact, 
nonlinear methods such as or nonlinear adaptations of 
the above linear methods, typically learn data representations 
where different classes become even linearly separable. How¬ 
ever, one difficulty of using nonlinear methods is that they 
compute an embedding only in a pointwise manner, i.e., data 
coordinates in the low-dimensional domain are computed only 
for the initially available training data and are not generalizable 
to the test data in a straightforward way. Hence, an important 
issue that needs to be addressed in order to benefit from 
nonlinear manifold learning methods in classification is the 
generalization of the embedding to novel data samples. 

The generalization of the learned embedding to new samples 
is referred to as the out-of-sample extension problem in mani¬ 
fold learning. Several previous works have addressed the out- 
of-sample problem. The study in |T0| focuses on the extension 
of manifold learning methods that compute data coordinates 
in the form of the eigenvectors of a data kernel matrix. It is 
shown that in such a setting the Nystrom method can be used 
to compute eigenfunctions that coincide with the eigenvectors 
on the training samples and generalize them to the continuous 
domain. In fact, the out-of-sample extension with the Nystrom 
formula as proposed in can also be derived from the kernel 
ridge regression framework, by removing the regularization 
term and imposing the constraint that the data coordinates of 
training samples be given by the eigenvectors of the data kernel 
matrix (TT). Next, several out-of-sample extension algorithms 
rely on the construction of an interpolation function between 
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the high- and low-dimensional domains. Some families of 
interpolation functions used in manifold learning extensions 
are polynomials sparse linear combinations of functions 
in a reproducing kernel Hilbert space (RKHS) GD , and sparse 
grid functions |T3| . In the out-of-sample extension of 
general manifold learning methods is achieved by computing 
a local projection of the high-dimensional space to the low¬ 
dimensional domain with a similarity transformation of the 
local PC A bases. There are also some extension methods 
designed for particular manifold learning algorithms. The 
study in GD proposes an out-of-sample generalization of the 
multidimensional scaling (MDS) method, which is based on 
an interpretation of MDS as a least squares problem. Similarly, 
the method proposed in |T^ presents a generalization for 
maximum variance unfolding GD 

Meanwhile, all of the above methods address the out-of- 
sample extension problem in an unsupervised setting, i.e., no 
class label information of input data samples is used. In a 
classification problem, on the other hand, different classes are 
often assumed to lie on different manifolds, e.g., in a face 
recognition problem, the face images of each individual form a 
different manifold, and supervised manifold learning methods 
map these class-representative manifolds to different manifolds 
in the low-dimensional domain. Therefore, class labels of data 
samples and the fact that different classes are concentrated 
around different low-dimensional structures should be taken 
into account when constructing an out-of-sample extension for 
classification applications. Besides this, many of the above 
unsupervised extension methods are even not applicable in 
the supervised setting. For instance, the popular Nystrom 
extension G9 considers embeddings given by the eigenvectors 
of a symmetric kernel matrix. Then, in order to embed a novel 
point, the kernel is evaluated between the novel point and each 
training point. Meanwhile, in supervised manifold learning, the 
value of the kernel depends not only on data sample pairs but 
also on the class labels of the samples. The kernel usually 
takes positive values for sample pairs from the same class and 
negative values for those from different classes, e.g., as in 
Hence, the Nystrom method does not have a straightforward 
generalization for supervised manifold learning. 

In this paper, we propose a method for constructing out- 
of-sample generalizations of supervised manifold learning 
algorithms for classification. In order to extend the embedding 
(learned with any supervised algorithm) to novel points, we 
compute a radial basis function (RBF) interpolation function 
from the high-dimensional space to the low-dimensional one. 
We optimize the parameters of the interpolation function such 
that it maps initially unavailable test points as close as possible 
to the embeddings of the manifolds of their own class in the 
low-dimensional domain. This is achieved with a progressive 
estimation of the class labels of test points while gradually up¬ 
dating the parameters of the interpolation function at the same 
time. As the proposed method makes use of test points in the 
construction of the interpolation function, it can be considered 
as a semi-supervised solution for obtaining an out-of-sample 
extension. Another criterion that is taken into account in the 
optimization of the parameters of the interpolation function 
is the regularity of the interpolation function. We find that 


the regularity of the interpolation function can be adjusted by 
optimizing its scale parameters to minimize a regularization 
objective, which controls the magnitude of the interpolation 
function gradient, while allowing sharp directional derivatives 
to occur only along the class separation boundaries in order 
to attain an effective separation between different classes. 
Experimentation on several image data sets shows that the 
proposed method can be effectively used in the classification 
of data of intrinsic low dimension. The proposed out-of-sample 
extension method is general and can be coupled with any 
supervised manifold learning algorithm. 

The rest of the paper is organized as follows. In Section 
[n| we briefly overview some supervised manifold learning 
methods and formulate the out-of-sample extension prob¬ 
lem. In Section III we describe the proposed method for 
classification-driven out-of-sample extensions for supervised 
manifold learning. In Section IV we discuss some aspects 
of the proposed algorithm, where we analyze its complexity 
and interpret it within the context of regression. In Section [V] 
we present some experimental results and in Section |Vl| we 
conclude. 


H. Overview oe manieold learning 
A. Manifold learning for classification 

Given a set of data samples {xi}f^^ C that reside in 
a high-dimensional space manifold learning computes a 
new representation i/i G in a lower-dimensional domain 
for each data sample Xi . Manifold learning methods generally 
assume that the samples {xi} come from a model of low intrin¬ 
sic dimension and search for an embedding that significantly 
reduces the dimension of the data (d ^ n) while preserving 
certain geometric properties. Different methods target different 
objectives in computing the embedding. The ISOMAP method 
computes an embedding such that Euclidean distances in 
the low-dimensional domain are proportional to the geodesic 
distances in the original domain |[T|, while LLE looks for an 
embedding that preserves local reconstruction weights of data 
samples in the original domain Q. The Laplacian eigenmaps 
algorithm |[^ first constructs a graph from the data samples 
where nearest neighbors are typically connected with an edge. 
The graph Laplacian matrix is given by L = D — W, where W 
is the NxN weight matrix whose entries are usually computed 
based on a kernel Wij = K{xi^Xj), and D is a diagonal 
degree matrix given by Da = ^ij • The embedding with 
Laplacian eigenmaps is then learned by solving 

mi n triY^LY) s.t. Y^ DY = I 

YeRNxd 

where I is the identity matrix. The solution to this problem 
is given by the d eigenvectors corresponding to the smallest 
nonzero eigenvalues of the generalized eigenvalue problem 
Lz = \Dz, where the coordinate vector yi for each data 
sample Xi is given by the i-th row of Y. Intuitively, such an 
embedding seeks data coordinates that have a slow variation 
on the data graph, i.e., two neighboring points on the graph are 
mapped to nearby coordinates. There exist linear versions of 
the Laplacian eigenmaps method as well. The above problem 
is solved under the constraint that Y be given by a linear 
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projection of X onto in fls) , which is applied to face 
recognitions problems in p9| and p0| . 

Recently, many extensions have been proposed for man¬ 
ifold learning for classification. Most of these methods are 
supervised adaptations of the Laplacian eigenmaps algorithm. 
In order to achieve a good separation between the classes, 
an embedding is sought where data coordinates vary slowly 
between neighboring samples of the same class and change 
rapidly between neighboring samples of different classes. The 
algorithm proposed in Q formalizes this idea by defining two 
graphs that respectively capture the within-class and between- 
class neighborhoods. Denoting the weight matrices of these 
two graphs by and and the corresponding Laplacians 
by Lw and the method seeks an embedding that solves 


min L^Y) - LbY) s.t. Y'^D^Y = I 

YeRNxd 

( 1 ) 

where /i > 0. The method proposed in 0 employs an 
alternative Fisher-like formulation for the supervised manifold 
learning problem where the embedding is obtained by solving 


max 

z 


Li)Z 

z'^L.nZ 


( 2 ) 


However, the problem is solved under the constraint z^ = 
v^X in order to obtain a linear embedding, where X = 
[xi.. .xn] is the n X data matrix and v G defines 

a projection. Variations over this formulation can be found in 
several other works such as 0,0,0,0,0 and l l^ . 


B. Out-of-sample extensions 

While most manifold learning methods learn the coordinates 
of only initially available data samples, in many applications 
including classification, the generalization of the learned em¬ 
bedding to the whole data space is an important problem. 
Given a set of data samples G in the high¬ 

dimensional ambient space and their corresponding coordi¬ 
nates {yi}fLi C in a low-dimensional space, the out-of- 
sample extension problem consists of constructing a mapping 
/ : ^ such that / gives the learned embedding 

f{xi) = Hi on the available data samples while generalizing 
the embedding to all points in 

A popular out-of-sample generalization algorithm is pre¬ 
sented in |T0| , based on the Nystrom formula. This method 
proposes a generalization for manifold learning algorithms 
that compute the coordinates based on an eigenvalue problem 
My^ = Xky^, where the symmetric matrix M is given 
by a data-dependent kernel Mij = M{xi,Xj) and y^ is 
the kih eigenvector of M, which defines the kih dimension 
of the data coordinates. The exact expression of the kernel 
matrix M as a function of the weight matrix W depends 
on the manifold learning algorithm to be generalized, as 
different algorithms target different objectives. The out-of- 
sample extension proposed in |T0| is then given by the function 
/G) = ■ ■ ■ /'^G)]> where 

1 ^ 

= (3) 


and yi = [yj .. .yf] are the coordinates of the embedding 
of Xi in This defines a continuous function that co¬ 
incides with the embedding at the initially available points 
f{xi) = yi. While this popular method provides straightfor¬ 
ward generalizations of many manifold learning algorithms 
such as ISOMAP, LLE, and Laplacian eigenmaps, it cannot 
be used with most supervised manifold learning methods. The 
reason is that, although the data kernel matrix M is assumed 
to be a general symmetric matrix (not necessarily positive 
semi-definite) in the entries of this matrix in supervised 
methods are not only dependent on the data samples Xi, but 
also on their class labels. For instance, in ([T]), the kernel matrix 
M is a normalized version of the matrix which is 

determined with reject to data class labels. In this case, the 
Nystrom formula ^ cannot be applied as M{x^Xi) is not 
priorly known for a test sample of unknown class. 

Several out-of-sample extension methods such as those 
based on fitting a particular type of interpolation function as in 
(n), fig, and (Tg can be applied for generalizing supervised 
embeddings by fitting a function / to the priorly learned 
(xi^yi) pairs. However, as this gives a generalized embedding 
based only on an approximation objective that does not take 
into account the class information of data, its classification 
performance is likely to be suboptimal. 

In this paper, we propose to learn an interpolation function 
in an application-aware manner. The proposed method not 
only makes use of the initially available training samples 
(xi^yi), but also exploits the test samples of unknown class 
in the learning, by jointly estimating the interpolation function 
parameters and the class labels of test samples. We describe 
this method in Section Jn] 

III. OUT-OF-SAMPLE EXTENSIONS FOR CLASSIFICATION 
A. Formulation of the out-of-sample problem 

We begin with a formalization of the classification-based 
out-of-sample extension problem. Let AI 2 , • • • AIm C 
be M compact manifolds representing M different classes 
in the original ambient space Let S be an embedding of 
the manifolds {Aim} in a lower-dimensional space 

m 

such that each manifold Mm C is mapped to £{Mm) G 
The restriction of £ to each manifold is assumed to be 
continuous and the embeddings of different manifolds are 
assumed to be disjoint. We consider that the data samples 
are drawn from a probability measure u on such that the 
samples of each class m are concentrated around the manifold 
Mm- Let Um denote the probability measure of class m having 
a support region Sm in where Mm ^ Sm- We denote by 
^ projection of the point x onto the manifold Mm^ 
which is a point on Mm of minimal distance to x 

\\x-PMr^{x)\\= min \\x-x'\\. 

x'eMm 

Here || • || denotes the usual ^ 2 -norm in the Euclidean space. 

As for the solution set of interpolation functions, we con¬ 
sider a compact set H of differentiable functions from 



4 


to M, where / : ^ belongs to given by the d- 

dimensional Cartesian product of An interpolation function 
that is suitable for classification should map points x from 
class m as close as possible to the set £{Mm)^ so that they 
can be correctly classified with respect to their representation 
in We thus define the embedding error of / with respect to 
its deviation from the embedding of the projection of a point 
onto the manifold of its own class. The total embedding error 
of a function / over all classes is then given by 




\\f{x) - £ {PMAx))f (4) 


The distributions I'rn are usually not explicitly known in 
practice. In order to avoid overfitting to training data, it is 
useful to enforce some regularity properties for the interpola¬ 
tion function /. A smoothness constraint can be imposed by 
controlling the total gradient magnitude. Meanwhile, nonlinear 
supervised manifold learning methods, whose extensions are 
targeted in this paper, typically learn a representation where 
different classes are likely to become linearly separable in 
the learned embedding. The coordinates defining the em¬ 
bedding are orthogonal when given by the eigenvectors of 
a symmetric kernel, or “nearly orthogonal” when given by 
the generalized eigenvalue problem ©• Different groups of 
classes are then expected to become separable along different 
dimensions of the learned embedding, which is also easy to 
confirm experimentally (see, e.g.. Figure 3(a)| ). Thus, when 
learning an interpolation function /, in order to enhance the 
separation between different classes, it is desirable to have 
sufficiently strong derivatives along the directions defining 
the boundaries of the distributions of different classes in the 
ambient space, especially for the components of / for 
which the considered classes are separable at dimension k 
of This is illustrated in Figure Given a dimension 
/c G {1,..., d}, let 


= {(m,p) :mdix£^ {Mm) < minf^(A^p) 
or max(Alp) < min£^{Mm)} 


denote the set of indices of manifold pairs whose embeddings 
are separable at dimension k, where £^{Mm) denotes the kth 
dimension of the embedding £{Mm)- Let denote the 

directional derivative of along the direction v. For a point 
X from class m, let Up{x) := {x — PmAx))/\\x - PmAx)\\ 
denote the unit vector corresponding to the direction of 
projection of x onto the manifold Mp of class p, where 
p ^ m. Then, we would like to learn an interpolation function 
/ such that the directions along which has the strongest 
derivatives coincide with the directions of the projections of 
points onto the manifolds of other classes. The total derivative 
magnitude along the directions of projection onto other classes, 
normalized by the average derivative magnitude is given by 


D{f) 




||v.,(.) fHx)\ 
E,,||V„/'=(x)|| 


dUm{x) 


( 5 ) 


where is assumed to have nowhere vanishing gradient 
and E^||V^;/^(x)|| denotes the mean directional derivative 
magnitude, induced from the overall distribution of data over 




Fig. 1. Illustration of supervised manifold learning and out-of-sample 
interpolation. Manifolds M.i and M .2 representing two different classes are 
embedded in a lower-dimensional domain such that they are separable along 
dimension k = 2, but not along dimension k = 1. The second component 
of the interpolation function f{x) = [f^{x) /“^{x )... f^{x)] should 
then have a sufficiently strong directional derivative Vuf^ix) along direction 
u at X, in order to reinforce the separation achieved by the supervised 
embedding, while it should vary smoothy along direction v. Meanwhile, 
the first component f^{x) of the interpolation function should have a slow 
variation along both directions u and v as the embeddings S(A4i) and 
^{M. 2 ) are not separable along dimension k = 1. 

all classes Q The normalization of the directional derivative 
by the average derivative aims to measure the derivative 
magnitude along separation boundaries relatively to the mean 
derivative magnitude. 

While the presence of sufficiently strong directional deriva¬ 
tives along separation boundaries is expected to enhance the 
separation between classes with the learned function, it is 
useful to control the smoothness of the interpolation func¬ 
tion by preventing it from attaining arbitrarily high gradient 
magnitudes. We thus define the total gradient magnitude 


G{P) ■=Y.I 

m ^rn 


E«I|V. /''(a:)|| 


dVm{x) 


( 6 ) 


which is also normalized by the average directional derivative 
so that it is comparable to the term in 

From 0 and one can define an overall regularization 
objective R that increases with the total gradient magnitude G 
and decreases with the directional derivative magnitude along 
separation boundaries D. One way to define the regularization 
objective R is as 


d 

Rif) = E (Gif) - 

k=l 

where A > 0. 

Finally, combining the embedding error in 0 and the 
above regularization term, we formulate the search of the 
interpolation function / as the optimization of the following 

^ The mean directional derivative can be formally defined as follows. Let 
Pu denote the probability density function corresponding to the probability 
measure u, and let St{x) C denote the (n — 1)-dimensional sphere of 
radius t centered at x. The expected value of the directional derivative of 
at X is then given by 


E,,||V„ p{x)\\ = \im 


fst(x) P-'G + II P(x)lldS 

fst(x) + dS 


where v denotes the unit surface normal in the direction of the surface element 
dS. 
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problem 

/ = arg min (V [ \\h{x) -£ {Pm^{x)) f dvm{x) 

cy.R(^h^^ 

where a > 0. A solution to the above problem exists as 
is compact and the objective function is continuous. 

In a real setting, the distributions Urn of data are often not 
explicitly known and one has access to a set of samples A' = 
drawn from these distributions. Let Ci G M} 

denote the class label of the data sample Xi, and M{xi) be 
the set of nearest neighbors of Xi in A’ (which can be chosen 
for instance as the if-nearest neighbors of Xi with respect to 
the Euclidean distance in Let 

denote the set of unit vectors that indicate the directions of 
the neighbors of Xi, and 

np{xi) = I : Xj e M{xi),Cj = p \ 

denote the set of unit directions given by the nearest neighbors 
of Xi within class p. We can then define the empirical 
embedding error E{f) as 


^(/)=E E \\f{Xi)-£{PMr.{Xi))\ 


m i: Ci=m 


and the empirical counterpart R(yf) of the regularization term 
R{f) as 


RU) = E 


k=l 


where 


G{p) ■■= E 


\WfHxi)\\ 


Hxi)\-^E.eM.,)\\^vfHxi)\\ 






( 7 ) 


( 8 ) 


( 9 ) 


In the above expressions, | • | denotes the cardinality of a 
set, and the mean directional derivative Ey\\\/v f^{xi)\\ is 
approximated by the average derivative of f^{xi) along the 
directions of the neighbors n{xi) of Xi. In the definition of 
we approximate the derivative S/up{xi) along 

the direction of the projection of Xi onto M.p with the average 
derivative along the directions of the nearest neighbors of Xi 
within class p, where np{xi) is assumed to be non-empty for 
all Xi and p. 

Now, having defined the objective function in the empirical 
setting, we come back to the actual manifold learning problem. 
In practice, the manifolds M.m are usually not explicitly 
known, and manifold learning methods compute an embedding 
for only the initially available training samples. Let us denote 
by At = {xi}^^ C A' the set of training samples with 


known class labels (where N < Q), for which an embedding 
yr = {yi}fLi is priorly computed with a supervised manifold 
learning algorithm. We assume that there exist embeddings 
£{Mm) of tho manifolds Mm such that the embeddings 
{Ui} of the samples of each class m are concentrated around 
£(Mm)- Although the training samples available in practice 
are not guaranteed to lie exactly on a manifold in general 
(due to noise, imprecise measurements, or several sources of 
deviation from the assumed model), we make the following 
approximations for a sample Xi G At of class m for the 
simplicity of computations: 


PMraiXi) ~ Xu £{PMm{Xi)) « VU 

The embedding error E{f) can then be decomposed as 
Mf)=ET{f)EEo{f) where 


N 

ET{f) = J2U{xi)-yif 

is the embedding error of the training samples AV and 

Q 

Eo{f)= E \\fixi)-£{PMM)f 
Q 

= E E \\fixi)-£iPMM)f 

m i=N+l 
Ci=m 

is the embedding error of the other samples than the training 
samples (test samples in A' \ AV). 

In the generalization of an embedding, one may wish to 
strictly preserve the learned coordinates of the training data 
f{xi) = Hi. We can thus formulate the out-of-sample extension 
problem for supervised manifold learning as follows: 

/ = arg min Eo{h) + aRih) s.t. EtW = 0. (10) 

hen^ 


In the above problem, if there are observations in X with 
unknown class labels, one needs to estimate the class labels 
Q for < i < Q. In the rest the paper, we focus on this 
general case. In Section |III-B[ we describe an algorithm that 
computes an interpolation function with a joint and progressive 
estimation of the function parameters and the class labels of 
data. 


B. Construction of the interpolation function 

In this study, we select the set R of interpolation functions 
for the out-of-sample extension problem as the radial basis 
functions (RBFs) 

where f :R ^ M+ is a differentiable kernel. The coefficients 
Cl, the kernel centers ai, and the scale parameters ai are 
assumed to lie in some compact domains in M, and IR+, 
respectively. The Gaussian function 0(t) = e~^ is a common 
choice for the RBF kernel due to its desirable properties such 
as its smoothness and rapid decay, which we also adopt in this 
work. 
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In our problem, we look for a function / = 

[f^{x) ... /^(x)] : ^ such that each dimension 

of / is given by 


f\x)=Y,cU 

1=1 


af )■ 


( 11 ) 


The construction of the interpolation function / is thus equiv¬ 
alent to the determination of the parameters cf, af, and 
the number of terms L. 

In the optimization problem in (Gif, the evaluation of 
Eo{h) requires the knowledge of the class labels Ci of Xi 
for 2 = - 1 - 1,..., Q, which are unavailable in the beginning. 

We propose to solve this problem with an iterative algorithm 
that progressively estimates the class labels and constructs 
a sequence of interpolation functions /i,..., /r, • • • /i? in an 
alternating scheme as described below. 

In iteration r of the algorithm, we construct a function 
with Lr terms. When fitting an RBF interpolation function to 
data, it is common practice to assign kernel centers as data 
points. In iteration r, we select the kernel centers af = Xr, 
as a subset of data samples {xri}ili C A', where the index 
sequence depends on the iteration r and denotes the 

indices of the data samples {xi} chosen as kernel centers. 
Throughout the iterations, the number of terms Lr is increased 
gradually such that N = Li <1/2 < ... < Lji = Q. Once 
the kernel centers af are fixed, the interpolation function fr in 
iteration r, characterized by the coefficients {cf} and the scale 
parameters {(jf}, / = Lr, /c = 1,..., d, is obtained by 

solving the problem 


min 

{cf}cB, {crf}cA 


^S(/) + ai?(/) 


s.t. Erif) = 0 (12) 


where B c R and A c are compact parameter domains 
(sufficiently large so that the constraint Erif) = 0 can be 
satisfied) and 


Lr 

<(/) = E E (13) 

m l=N-\-l 

Cri=m 

The problem has a solution as a continuous function over 
a compact domain attains its minimum. 

Before discussing the solution of ( p^ , we first give an 
overview of the method. In iteration r, once the interpolation 
function fr is computed by solving we estimate the class 
label of each point Xi for N ^ 1 < i < Q hy assigning it the 
class label of the training point xj such that fr{xj) is the 
closest to fr{xi) in the low-dimensional domain 


Ci = Cj : j = argmin \\fr{xq) - fr{xi)\\, 1 < q < N. 

(14) 

At the same time, a confidence score is assigned to each 
estimate Ci by comparing the distance of Xi to its nearest 
neighbor Xj within all classes and to its nearest neighbor Xj^ 
among the classes other than Cj : 

^ WfjXj') - f{Xi)\\ . 

ll/(^.)-/«ll ■ 

f = argmm \\fr{xq) - fr(xi)\\, l<q<N,Cn^ Cj. 

(15) 


The confidence score i^i thus decreases with the “ambiguity” 
in assigning Xi the class label Ci with respect to the nearest- 
neighbor decision rule in via fr. 

The confidence scores jii obtained in an iteration are then 
used in the next iteration for the selection of the kernel centers. 
In iteration r, the kernel centers are determined based on the 
confidence scores computed in the previous iteration r — 1 
as follows. The first N kernel centers {anil = {Xn}li 
consist of the training set TV, i.e., n = / for / = 1,..., A^. 
The remaining kernel centers 

first Lr — N points in A' \ AV of highest confidence scores. 
The alternating stages of computing fr and estimating the class 
labels Ci and obtaining the confidence scores jii are repeated 
until the last iteration R, where all data samples are included 
in the set of kernel centers {af}f^^ = A'. The interpolation 
function / is then given by fR, and the class labels of the 
points in A' are obtained by estimating them with the final 
interpolation function with respect to Gl- 

We now discuss the solution of the problem ( p^ . First, 
observe that for any Lr input data pairs (xi^i/i) G x 
and any choice of the scale parameters erf, one can find 
interpolation functions f^ of L = Lr terms that satisfy 
f{xi) = Hi as follows. Setting af = xi for I = 1,...,!/^, 
the constraints f^{xi) = yf yield the linear system 

= / ( 16 ) 


where = [c\ .. .c\Y the coefficient vector, = 
[yf .. .y\ Y consists of the kih dimensions of {yi}, and 


= 


\Xi - xi\ 

af 


(17) 


is the matrix of RBFs evaluated at data points Xi. The square 
matrix is invertible if the points Xi are distinct and cj) is 
chosen as the Gaussian kernel The system ( p^ then has a 
unique solution = (^^)“^^Ewhich satisfies f^{xi) = yf. 

In iteration r = 1, we have Li = N and all kernel centers 
are training points. In this case the embedding error in ( p^ is 
^oif) ~ optimization problem is reduced to 


Due to the above discussion, the constraint ET{f) = 0 
can be satisfied for any choice of scale parameters af by 
setting the coefficients as = ($^)“^^^. This reduces the 
problem to the minimization of the regularization term R{f) 
by optimizing the scale parameters {af} under the constraint 

min R{f) (18) 


where the summations in the terms ^ and (|^ of R{f) run 
over the set of training samples AV. The regularization term is 
a non-convex function of the scale parameters [af] with nu¬ 
merous extrema. Meanwhile, we have experimentally observed 
that the variation of R{f) with is quite regular when all 
scale parameters erf, I = 1,..., Li, in each dimension k are 
set to a common value a ^. Moreover, setting all scale parame¬ 
ters to the same value across each dimension also simplifies the 





7 


optimization problem, as it reduces the number of optimization 
variables from Lik to k. We thus propose to solve the problem 
under the constraint af = for / = 1,..., Li. Since 
the form of R{f) in 0 is decomposable into its components 
in different dimensions, the scale parameter of dimension k is 
given by 

min (dif) - XD{f)) . 

It is difficult to analyze the above function theoretically. 
Meanwhile, in practice we have observed that G(/^) increases 
with monotonically. Moreover, if the underlying embedding 
obtained with supervised manifold learning provides a “bal¬ 
anced” distribution of the classes across different dimensions 
while ensuring a sufficient separation, the total directional 
derivative along class separation boundaries D{f^) first in¬ 
creases at a fast rate with at small scale values, and then it 
stagnates or the rate of increase is highly reduced. This is due 
to the fact that, when the scale parameters are too small, the 
interpolation function is too localized around kernel centers 
and its support does not cover well the whole space. Then, 
it does not have sufficiently strong derivatives along class 
separation boundaries. As increases, there typically exists 
a range for where the directional derivatives along class 
separation boundaries are relatively stronger than those along 
other directions, thanks to the underlying learned embedding 
that separates different classes and guides the interpolation 
function via the condition f^{xi) = imposed on training 
samples. This range for the scale parameters coincides in 
general with the interval of scale parameters where a good 
classification performance is attained. If the scale parameters 
are increased beyond this range, the gradient of the function 
increases too much, resulting in an overfitting of the inter¬ 
polation function, where the advantage of having sufficiently 
strong directional derivatives along class separation boundaries 
is lost as strong derivatives appear in other directions as well 
due to overfitting. This is illustrated with a simple example 
in FigureFigure 2(a) shows two manifolds M.i^M .2 C 
representing two different classes, and four training samples 
selected from the distribution concentrated around each man¬ 
ifold. Let us consider a one-dimensional embedding of the 
manifold samples in M such that samples from Mi and M2 
are mapped respectively to 1 and —1. An ideal interpolation 
function /(x) : ^ M separating the two classes well 

in M should have gradients in the directions shown in red 
in Figure |2(a)[ which are orthogonal to the class separation 
boundary. In Figures |2(b)[2(d)| an RBF interpolation function 
/ with Gaussian kernel is fitted to the training data, and f{x) 
is plotted over the displayed region of where white and 
black colors correspond respectively to 1 and —1. The scale 
parameter is chosen as cr = 0.5, cr = 2, and cr = 6 respectively 
in Figures |2(b)f |2(d)[ The scale parameter is observed to 
be too small in Figure |2(b)| as the support of / does not 
cover the manifolds sufficiently. The scale parameter in Figure 


2(c)| yields an accurate interpolation function that separates 
the two classes well, where the directions along which / 
has strong derivatives are close to the directions shown in 
Figure |2(a) Meanwhile, the selection of a too large value for 



(a) (b) (c) (d) 


Fig. 2. Illustration of the effect of the scale parameter on the accuracy of 
the interpolation function, (a) Manifolds C representing two 

different classes and samples chosen from each class. An ideal interpolation 
function / separating the two classes well in M should have gradients in the 
directions shown in red. (b) Function / constructed with a = 0.5. The scale 
parameter is observed to be too small as the support of the function does not 
cover the manifolds well, (c) Choosing the scale as a = 2 yields a good 
interpolation function, (d) Choosing a too large scale parameter a = 6 results 
in an overfitting of the interpolation function, with large derivatives in the 
indicated directions. 


the scale parameter in Figure |2(d)| results in an overfitting 
of the interpolation function. In particular, strong directional 
derivatives are observable in directions other than the class 
separation boundary directions as well due to overfitting, e.g.. 


the directions shown in red in Figure 2(d) 

In optimizing cr^, we look for an interval where D{f^) 
is large enough while G{f^) is not too high. We set the 
weight parameter A to a value where the effects of both 
of these terms are visible, often yielding a nonmonotonic 
variation of the overall regularization term G{f^) — 
which first decreases with due to the sharp increase in 
D{f^) and then increases with due to the stagnation of 

D{f^) and the continuing increase in the first term G{f^). 
The optimal value of can then be found easily with a 
simple descent or line search algorithm by minimizing the 
one-dimensional regularization term G{f^) — \D{f^). We 
finally note that other configurations of these two terms D{f^) 
and G{f^) in a regularization objective R{f) (rather than a 
linear combination) may also be possible, depending on the 
underlying embedding. This will be discussed in more detail 
in Section]^ as well as the links between the regularization 
objective R{f) and the classification performance. 

Having examined the computation of the scale parameters 
and the coefficients of fi in iteration 1, we now discuss the 
solution of the problem ( p^ in a general iteration r. Due to the 
iterative estimation of the class labels and the calculation of 
the function parameters, the class labels Gn of the points x^ 
contributing to the embedding error are already estimated 
in the previous iteration. The manifolds Mm and the embed¬ 
ding £ are not explicitly known in the term £ (Px^(x^J). 
However, relying on a locally linear approximation of the 
manifolds, one can estimate the projection of a point x onto 
Aim as a convex combination of its nearest neighbors, which 
can then be used to compute f (Px^(xrJ)|^ Denoting the 
indices of the K nearest neighbors of x within the training 
samples of class m sls {ai}fLi, and the set of nearest neighbors 


^Note that, although the interpolation function of the previous iteration gives 
an estimate of the embedding of a point x as fr-i{x), it is more reliable 
to update the embedding by projecting x onto the manifold Aim- This is 
because the embedding fr-Rx) employs no priors on the class label of x 
and is indeed used to estimate the class label of x, while the recomputation of 
the embedding as S (Pwru (^)) estimated class label of x. In fact, 

S (PMm (^)) coincides with the value of the updated interpolation function 
fr (x) of iteration r for a: = Xri as discussed below. 
































as Mm{x) = {xai}f=i, the projection is approximated as 

K 

PM^{x) « Wj Xg, 
i=l 

where w = [wi... wk]^ is the vector of weights given by 

K 

w = aigmm\\x-^ViXaiW'^ s.t. > 0, = 1 

(19) 

which can be solved with quadratic programming. From the 
continuity assumption of the embeddings, the embedding 
£ of then estimated as 

K 

£ {PMm (^)) ~ X] Vat (20) 

i=l 

where ya^ are the coordinates of in the learned embedding 
in 

Letting = £ (^Pmc^i the total embedding error 

is given by 

Lr 

E^{f) = E^oU) + Mf) = Y, E \\fM-yr^\?■ 

m 1=1 

Cri=m 

Since in iteration r an interpolation function of Lr terms 
is constructed, for any choice of the scale parameters {crf^}, 
fitting the coefficients to the observations as 
yields E^{f) = 0, which immediately satisfies the constraint 
Erif) = 0 on training samples. It then remains to minimize 
the regularization term by optimizing the scale parameters 
as in (p^p] This concludes the description of the proposed 
method. As the proposed algorithm employs unlabeled test 
samples in learning an out-of-sample extension, we call 
it Semi-supervised Out-of-Sample Interpolation (SOSI). The 
method is summarized in Algorithm 

IV. Discussion 
A. Complexity analysis 

We now derive the complexity of the proposed method, 
which is essentially determined by the complexity of steps [TT]- 

nation of the nearest neighbors in AV for each test image is of 
complexity 0{nN), and the solution of the quadratic program 
in ( p9] ) has a polynomial-time complexity 0(poly(Ff)) in the 
number of neighbors K p4| . The complexity dK of ( |20| ) 
can be neglected as d is small. Since the embedding of the 
projection of each point in A \ Xt is computed only once 
throughout the algorithm, we get the overall complexity of 
step [ll] as 0{Q (poly(iT) + nN)) ^ 0{nQN). 

Next, step requires the evaluation of the regularization 
term R{f) at several values and the corresponding coef¬ 
ficients = {^^)~^y^. The computation of the coefficients 

^In practice the optimization of scale parameters can be omitted for r > 1 
and the scale parameters can be set to the values obtained in iteration r = 1 
in order to speed up the algorithm without much change in the performance, as 
the reoptimization of the scale parameters results in values in the vicinity 
of those obtained at iteration r = 1 in general. 


13 in the main loop of the algorithm. In stepfTT] the determi- 


Algorithm 1 Semi-supervised Out-of-Sample Interpolation 
(SOSI)_ 

1: Input: 

A’ = of labeled and unlabeled data samples 

{Ci}iLi' Class labels of training data C A, where 

N <Q. 

2: Initialization: Assign number of iterations R and number of RBF terms 
{Lr}^i in each iteration such that Li = N, Lji = Q (possibly with 
equispaced intervals between N and Q) 

3: for r = 1 do 

4: Set kernel centers af = xi for I = 1,... , N, k = 1,..., d 

5: Optimize scale parameters af of fi by minimizing R{f) subject to 

the constraints ^ p = 

6: Estimate class labels Ci and compute confidence scores [li for i = 

1,..., Q by NN classification with fi in 

7: end for 

8: for r = 2,. .., do 

9: Determine {xri}^Li such that {xri}i£i = A^ and 

are the points in A \ Ax' with highest confidence scores 
10: Set kernel centers as af = Xri for / = 1,..., Lr, k = 1,..., d 

11: Compute the embeddings of the projections of Xri on the manifolds 

as in pO) and set Hn = S (^^Mcr 

12: Optimize scale parameters of fr by minimizing R{f) subject to 

the constraints = a^, P = 

13: Update class labels Ci and confidence scores for i = 1,..., Q 

with NN classification with p in 

14: end for 
15: Output: 

Out-of-sample interpolation function f = fji : ^ given by 

{Pi}?=N-\-v Class labels of initially unlabeled data samples 


requires the solution of an Lr x Lr linear system, whose 
complexity is between 0{Lr‘^) and 0{Lr^). Then, for a 
given and the corresponding c^, we analyze the evaluation 
of L){f^). The computation of the gradient V f^{xi) is of 
complexity 0{nLr). Assuming that each training point Xi has 
around K nearest neighbors in each one of the M classes, 
the computation of the directional derivative Vuf^{xi) for all 
neighbors of a point Xi is of complexity 0{n{Lr + KM)). 
Since this is repeated for all N training points Xi, the com¬ 
plexity of computing L){f^) is of 0{nN{Lr KM)). Since 
the complexity of G{f^) is dominated by that of D{f^), 
the optimization of is of 0{Lr^ + nN{Lr + KM)). 
Performing this optimization for all d dimensions, upper 
bounding Lr by Q, and repeating this for all R iterations 
gives the complexity of step throughout the algorithm as 
0{dR{Q‘^PnN{QpKM))) ^ 0{dnRN{QpKM)). If one 
omits the reoptimization of for r > 1, the complexity of 
step 1^ is reduced to the optimization of scale parameters at 
the first iteration r = 1 and the update of the coefficients 
at every iteration, which is of 0{dnN{N -b KM) -|- dRQ^). 


Step pA] requires the evaluation of f{xi) for all Xi G X\Xt, 
which is of 0{dnQLr), and the comparison of the function 
values to those of the training points, which is of 0{dQN). 
The complexity of repeating step throughout R iterations 
is then of 0{R{dnQ‘^ -b dQN)) = 0{dnRQ‘^). Finally, com¬ 
bining the complexities of steps pTpS) we get the complexity 
of the overall algorithm as 0{dnR^). 
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B. Relation to kernel ridge regression 

In this section, we discuss how out-of-sample extensions of 
supervised manifold learning methods with RBF interpolation 
can be interpreted within the context of kernel ridge regression. 
Ridge regression is a well-known statistical method that learns 
a linear function to model the dependency between a set of 
input training points C and the associated outputs 

{yi}fLi C For each dimension of the outputs yi = 
[y] .. .^f], the algorithm looks for a linear model f^{x) = 
w^x that minimizes 

N 

i=l 

which is a slightly modified version of the least squares 
method by adding a regularization term representing the 
squared norm of the fitted linear model. Here a > 0 is a 
parameter adjusting the weight of the regularization term. An 
alternative formulation of ridge regression is proposed in p5] 
that is based on a dual version of the above problem. The 
solution of the dual problem yields the following prediction 
f^(x) of the output value for a new input sample x: 

f(x) = (yY(-f^ + a-^r^v- ( 21 ) 

Here, y^ = [^i ...^^]^ is the vector of output values for 
training samples, K G is the matrix of inner products 

of input samples whose entries are given by Kij = {xi ,Xj), 
I is the identity matrix, and v G is the vector of inner 

products of X with Xi, whose ith entry is given hy Vi = (x^Xi). 

Since this formulation only involves the inner products 
between the samples x and {xi} rather than the samples 
themselves as vectors, it permits a kernel extension of the 
regression problem, where the samples are mapped to a high¬ 
dimensional feature space F via a kernel t/i : ^ F. The in¬ 

ner products in K and v are then evaluated in the feature space 
as Kij = { 2 p{xi), 2 p{xj)) Sind Vi = { 2 p{x), 2 p{xi)). Translation- 
invariant kernels are a widely-used family of kernel functions, 
where the inner product {'ip{xi),'ip{xj)) in the feature space 
depends only on the difference ||xi — Xj || between the samples 
in the original space. 

Out-of-sample extensions with RBF kernels as in the pro¬ 
posed method are linked to kernel ridge regression in the 
following way. If the regularization term (a = 0) is omitted 
in the kth dimension of the output vector for the input 
sample x is given by 

fix) = iy’^)^K-^v. ( 22 ) 

If the kernel Kij is set as { 2 p{xi), 2 p{xj)) = (l){\\xi — Xj\\/a) 

with the RBF kernel used in interpolation, one can observe 
from (rf\ that the kernel matrix K coincides with the matrix 
when a constant scale parameter a is chosen for dimension 
k of the interpolation function. Defining v similarly with the 
RBF kernel 0, the interpolation function in ( pT| ) can be written 
as f^{x) = The coefficients of the interpolation 

function being given by we obtain 

fix) = f)^v = 


which is the same as the result obtained with kernel ridge 
regression in ( |22| ). 

We thus observe that fitting an RBF interpolation function 
for manifold embeddings is the equivalent of learning a kernel 
ridge regression model (with no regularization) such that the 
output values yf are the coordinates of data samples in the 
computed embedding. Therefore, the studied out-of-sample 
extension setting can be regarded as a kernel ridge regression 
adapted particularly to manifold-structured data. Indeed, in the 
general and traditional regression setting for classification, no 
assumption is made about the structure of data, and the output 
vectors yi are taken as the class labels. Taking yi's simply as 
the class labels of data transmits only the class information 
to the regression algorithm and conveys no information about 
the geometric properties of data. Meanwhile, first computing 
an embedding with a supervised manifold learning algorithm 
and then learning the regression model on the coordinates 
y^ of data in (instead of taking ^^’s directly as class 
labels) allows the classifier to be guided by the special 
geometric structure of data samples concentrated around class- 
representative manifolds. Coordinates learned with supervised 
manifold learning algorithms reinforce the class information of 
data by enhancing the separability between the classes, while 
the manifold structure of data is also preserved in each class. 

V. Experimental results 

In this section, we evaluate the performance of the proposed 
method in classification experiments. We apply the presented 
out-of-sample extension algorithm on two different supervised 
manifold learning methods. First, we consider the supervised 
Laplacian eigenmaps algorithm presented in |[^, which com¬ 
putes an embedding by solving O- Next, we evaluate our 
algorithm on embeddings obtained with the Fisher-like objec¬ 
tive function in (|^, which is used by methods such as 0 ,®, 
Q, and 1^ . However, we compute a nonlinear embedding 
by removing the linear projection constraint X, so 

that the out-of-sample extension problem is of interest. 

We compare the following methods in the experiments, the 
first four of which provide out-of-sample extension solutions 
for manifold embeddings. When testing the out-of-sample 
extension methods, class labels of test images are assigned 
with nearest-neighbor classification in the low-dimensional 
domain of embedding. 

• Proposed semi-supervised out-of-sample interpolation 
method (SOSI) 

• RBF fitting: An RBF interpolation function is fitted only 
to the training samples, which is the equivalent of the 
interpolation function /i computed at the end of iteration 
r = 1 in Algorithm Test images x are then mapped to 

via the function fi{x). 

• Locally linear embedding (LLE): Test points in are 
mapped to with an adaptation of the LLE algorithm 
0 to the out-of-sample problem. Given a test point 
X e first its approximation is computed as a linear 
combination of its nearest neighbors in AV with weights 
adding up to 1 as in LLE. The point x is then mapped 
to ^ G as the linear combination of the embeddings 
of the same neighbors with the same weights. 
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(a) Supervised Laplacian eigenmaps (b) Fisher-based embedding 





(a) Yale face database 



SB □ 




□ □ 


(b) ETH-80 object database 


Fig. 3. Three-dimensional embeddings of the Yale face data set obtained 
with the two manifold learning methods used in the experiments 

• NyStrom: The original Nystrom formula is not applicable 
since the data-dependent kernel depends on the class la¬ 
bels as discussed in Section III-BI We thus use a modified 
version of the Nystrom method, where f^{x) is taken 
as a linear combination of the embedding coordinates yf 
weighted by the kernel as in The kernel M in the 
formula is taken as the same type of kernel (Gaussian 
kernel) used in the construction of the within-class and 
between-class weight matrices and and it is 
normalized for each test sample so that the kernel values 
M{x,Xi) sum up to 1. 

• Nearest neighbor classification in the original data space 

• SVM in the original data space 

• Semi-supervised learning (SSL) using Gaussian fields: 
Since the proposed out-of-sample extension method can 
be regarded as a building block of a semi-supervised clas¬ 
sifier, we also compare our results with those of a semi- 
supervised classification method. We test the performance 
of SSL with the algorithm proposed in 1^ , which is 
a state-of-the-art semi-supervised classifier based on the 
computation of a smooth function on the data graph that 
coincides with the class labels when evaluated at data 
samples of known class labels. 

We first evaluate the proposed method on a data set 
consisting of the face images of 12 individuals from the 
extended Yale face database p7| , which includes 58 images 
of each individual taken under different poses and illumination 
conditions. The images are normalized, converted to grayscale 
and downsampled to a resolution of 17 x 20 pixels. A sample 
image of each subject in the data set is shown in Figure [4(a) 
The supervised Laplacian eigenmaps and the Fisher-based 
embedding algorithms are used to map the data (17 x 20-pixel 
images) to The weight parameter is set as /i = 0.01 in the 
supervised Laplacian eigenmaps method. Figure shows the 
embeddings of a subset of the data set containing 10 labeled 
images of each individual, computed with the supervised 
Laplacian and the Fisher-based embedding algorithms. Only 
the first three dimensions of the coordinates are plotted for 
illustration. It can be observed that both methods compute 
representations with an enhanced separation between different 
classes. The supervised Laplacian eigenmaps method yields 
an even distribution of different classes across different di- 




(c) COIL-20 object database 


mensions. Since each dimension of the embedding renders 
several pairs of classes separable, sufficiently many class pairs 
contribute to the total directional derivative D{f^) in ([^ for 
each dimension k. This causes the variations of D{f^) and 
G( f^ ) with the scale parameter to be as discussed in Section 
such that G{f^) increases at a faster rate than D{f^) 


III-B 


at large scales due to overfitting. Thus, for the embeddings 
obtained with supervised Laplacian eigenmaps, we optimize 
the scale parameters by minimizing the regularization term 
R{f) as in 00 Meanwhile, the embedding computed with 
the Fisher-based objective yields a more “polarized” represen¬ 
tation, where each dimension of the embedding is observed 
to separate out only one class from the others. When there 
are not sufficiently many separable class pairs in the 

estimation of the variation of this term with the scale parameter 
may become unreliable or biased by a particular class in each 
dimension. We have observed that, when the embedding is 
computed with the Fisher-based objective, the variation of 
D{f^) with the scale parameter is closer to that of G{f^) 
(in comparison with supervised Laplacian eigenmaps). The 
choice of the regularization term R{f) as a linear combina¬ 
tion of these two terms may then lose its reliability, as it 
may become a monotonic function of the scale parameter, 
for instance. Therefore, for the Fisher-based embedding, we 
apply a slightly modified procedure for optimizing the scale 
parameters, where we choose a sufficiently large value for the 
scale parameter in each dimension, which ensures, however, 
that the D{f^)/G{f^) ratio stays above a certain threshold 
value. The scale parameters of the RBF fitting method are set 
as equal to those of the proposed SOSI algorithm. Figure 
shows the classification errors obtained with all methods for 
the supervised Laplacian and the Fisher-based embeddings. 
Each curve displays the misclassification rate (in percentage) 


^Occasionally, the scale parameter of one dimension or a few dimen¬ 
sions k may diverge from the scale parameters of the rest of the dimensions, 
which may cause instabilities. In order to avoid this, we bound the final values 
of the scale parameters to an interval of two standard deviations around their 
mean value averaged over all dimensions. 
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(a) Supervised Laplacian eigenmaps (b) Fisher-based embedding 


Fig. 5. Misclassification rates of face images from Yale database 



(a) Supervised Laplacian eigenmaps (b) Fisher-based embedding 


Fig. 6. Misclassification rates of object images from ETH-80 database 


of unlabeled images, obtained by varying the ratio between the 
number of labeled and unlabeled images in the data set. The 
results are the average of 5 repetitions of the experiment by 
randomly choosing the labeled samples. An early stopping rule 
is applied in the SOSI algorithm for the leftmost point of the 
curve (the labeled/unlabeled ratio of 0.11) due to the relatively 
high error, where the interpolation function construction is 
terminated when around 80% of the unlabeled points are added 
as RBF kernel centers. It is observed that the proposed method 
outperforms the other out-of-sample extension methods in 
comparison, as well as the SVM classifier and the semi- 
supervised graph-based classifier. 

We then repeat the same experiment on two different 
databases of object images captured under varying viewpoints. 
The first experiment is conducted on the images of 8 objects 
from the ETH-80 database 1^ , where 41 images are available 
for each object (in particular, the images of the first object in 
each object category are used so that the images in each class 
belong to the same manifold). A sample image of each object 
is shown in Figure [4(b)| The images are normalized, converted 
to grayscale, and downsampled to a resolution of 20 x 20 
pixels. An embedding of dimension d = 15 is computed 
with the supervised Laplacian eigenmaps and the Fisher-based 
manifold learning algorithms. The second experiment is done 
on the images of 20 objects from the COIL-20 database 
with 71 images for each object, which are normalized, 
converted to grayscale, and downsampled to a resolution of 
32 X 32 pixels. Figure |4(c)| shows a sample image for each 
object. The images are embedded in a space of dimension 
d = 25. In both experiments, the optimization of the scale 
parameters is done as in the previous experiment. The results 
obtained with the two object data sets are presented in Figures 




(a) Supervised Laplacian eigenmaps (b) Fisher-based embedding 


Fig. 7. Misclassification rates of object images from COIL-20 database 

and |7] The misclassification rates of unlabeled samples are 
plotted with respect to the ratio between the number of labeled 
and unlabeled samples. The results are the average of 5 random 
partitionings of the data set. As the classification error of the 
ETH-80 database is relatively high, an early stopping rule 
is applied for this data set by terminating the interpolation 
function construction when around 70% of the unlabeled 
samples with the highest confidence scores are added as RBF 
kernel centers. The results show that the proposed method 
often yields the smallest classification error in the experiment 
of Figure while it is outperformed only by the semi- 
supervised learning method in Figure [7] This graph-based 
semi-supervised learning algorithm performs particularly well 
on the COIL-20 data set, due to the dense sampling and the 
regular structure of the object image manifolds. 

The overall consideration of these experiments shows that 
the proposed out-of-sample extension method for supervised 
manifold learning provides a better performance than the 
reference out-of-sample extension strategies in comparison, 
while it can provide an alternative solution for semi-supervised 
learning when coupled with a supervised dimensionality re¬ 
duction method and thus regarded as one building block of 
a semi-supervised classifier. In particular, one can observe 
in Figures [5][7] that SVM and graph-based SSL may perform 
very differently in different settings. SVM is based purely on 
the representation of the data samples in the original ambient 
space while graph-based SSL only uses the information of 
the similarities between neighboring data samples instead of 
interpreting them as vectors in the high-dimensional space . 
Meanwhile, the proposed method is expected to find a com¬ 
promise between these two approaches, as the interpolation 
function depends both on the coordinates of the data samples 
in and the coordinates of the embedding in learned with 
a supervised manifold learning algorithm that relies on the 
graph representation of data. The experimental results seem to 
confirm this expectation, as the proposed classification solution 
attains a reasonably good performance in situations where 
SVM or graph-based SSL may fail (as in Figures and 
respectively). 

In the experiments of Figures m the interpolation func¬ 
tions of the out-of-sample methods other than SOSI are 
constructed using only the training data. The information 
present in the unlabeled data samples is not exploited in the 
construction of these interpolation functions, whereas SOSI 
uses these points to gradually add them as kernel centers of 
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Fig. 8. Misclassification rates obtained with progressive integration of the test 
images in the extended training set. Out-of-sample extensions are computed, 
class labels are assigned, and embeddings are updated with the extended 
training set in each iteration. 

the learned interpolation function. In order to assess the perfor¬ 
mance of the proposed method in the progressive integration 
of the unlabeled data samples in the learning process, we 
do an additional experiment. The proposed SOSI algorithm 
is used to classify unlabeled test images in an iterative way as 
described in Algorithm Then, in order to compare SOSI 
with the other out-of-sample extension methods, for each 
one of these methods, we carry out an iterative classification 
procedure as follows. In each iteration, all test images are 
assigned class labels with nearest-neighbor classification in the 
low-dimensional domain via the out-of-sample generalization 
strategies in comparison, and a confidence score is obtained 
for each test image as in ( p?] ). Then in the next iteration, 
the test images with the highest confidence scores are added 
to the training set with their estimated class labels and a 
completely new embedding of this extended training set is 
computed with the supervised Laplacian eigenmaps and the 
Fisher-based embedding algorithms (thus new coordinates are 
assigned to the original training images as well). The out-of- 
sample extension of this new embedding is then recomputed 
with the tested strategies in comparison, which are used to 
reclassify the test images. In each iteration r, the compared 
methods use the same number of extended training images 
in A' (same as the number of terms in the interpolation function 
of SOSI), while the choice of the extended training set varies 
between the compared methods as a result of the different con¬ 
fidence scores they assign to the test images. This progressive 
procedure is continued until all test images are included in the 
extended training set. The results obtained on the face images 
from the Yale database are presented in Figures |8(a)| and |8(b)[ 
respectively for the supervised Laplacian eigenmaps and the 
Fisher-based embedding algorithms. The image set of each 
subject contains 10 labeled and 48 unlabeled samples in this 
experiment. The misclassification rates obtained throughout 
the iterations are plotted with respect to the ratio L^jN 
between the size of the extended training set (number of RBF 
terms for SOSI) and the size of the original training set. The 
results indicate that the best classification accuracy is achieved 
by the proposed algorithm most of the time. The misclassi¬ 
fication error obtained with the proposed method decreases 
regularly throughout the iterations as the number of terms in 
the interpolation function increases, while the evolution of the 
misclassification error with the other strategies is less regular 


and the error may even increase throughout the iterations. This 
is due to the fact that the strategies other than SOSI compute 
a new embedding of the extended training set from scratch in 
each iteration. The mislabeled data samples in the extended 
training set may then significantly infiuence the computed 
embedding and consequently the class label assignments of the 
next iteration, since the embedding given by the eigenvectors 
of a class-dependent kernel matrix may change dramatically 
even with small errors in the kernel matrix. The proposed 
method does not suffer from this problem, since it preserves 
the original embedding and refines only the interpolation 
function throughout the iterations, which has a regularizing 
effect that better tolerates inaccurate assignments of the class 
labels of test images. We finally note that, among the strategies 
compared in this experiment, the proposed SOSI algorithm 
is the only one that provides an out-of-sample solution for 
manifold learning when Lr > N. 

Finally, we study the infiuence of the scale parameters of 
the interpolation function on the classification performance. 
As discussed in Section III-B the proposed method selects the 
scale parameters by optimizing the regularization term R{f). 
In order to evaluate the effect of this regularization approach 
on the classification accuracy, we compare the variations of the 
regularization term R{f) and the classification error with the 
scale parameter. We compute an embedding of the training 
images with the supervised Laplacian eigenmaps algorithm 
and then construct an RBF interpolation function, where all 
scale parameters af are set to a common a value and the 
coefficients cf are computed to fit the training images and the 
learned embedding for this choice of the scale parameter (as 
in the RBF fitting method or the first iteration of SOSI). A 
sequence of interpolation functions are computed by varying 
the scale parameter a, and for each interpolation function, 
the regularization objective R{f) is computed as well as the 
misclassification rate of the test images. The variations of the 
regularization cost and the misclassification rate with the scale 
parameter a are presented in Figure for all three data sets 
used in the experiments. The results suggest that the regular¬ 
ization objective R{f) has a rather smooth and nonmonotonic 
variation with the scale parameter, which resembles that of the 
classification error. Moreover, the interval of scale parameters 
a minimizing the regularization objective R{f) coincides with 
the range of a values where the misclassification rate takes 
small values. This shows that the proposed regularization 
objective permits the algorithm to capture the influence of the 
scale parameters on the performance of learning and can be 
used for optimizing the scale parameters. 


VI. Conclusions 

We have proposed a method for the out-of-sample exten¬ 
sions of supervised manifold learning algorithms that embed 
a set of class-representative manifolds residing in a high¬ 
dimensional ambient space to a set of manifolds in a lower¬ 
dimensional domain. The proposed out-of-sample generaliza¬ 
tion method is based on the construction of an RBF inter¬ 
polation function, where the parameters of the interpolation 
function are optimized to minimize the embedding error over 
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(a) Yale face data set (b) ETH-80 object data set (c) COIL-20 object data set 


Fig. 9. Variations of the misclassification error and the regularization term R{f) with the scale parameter of the RBF kernels 


a set of initially unlabeled data samples, whose class labels 
are estimated progressively along with the parameters of the 
interpolation function. We have shown that the regularity of the 
interpolation function can be controlled by optimizing the RBF 
scale parameters to minimize a regularization objective that 
controls the total gradient of the interpolation function while 
encouraging sufficiently strong derivatives along the directions 
of class separation boundaries in order to ensure an effective 
separation between different classes. The proposed out-of- 
sample generalization method outperforms baseline interpo¬ 
lation solutions in classification applications. Experimental 
results suggest that the proposed algorithm achieves state- 
of-the-art performance in semi-supervised learning and can 
be effectively used along with supervised manifold learning 
methods in the classification of low-dimensional data sets 
consisting of labeled and unlabeled data samples. 
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